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INTRODUCTION 

The  purpose  of  this  paper  is  to  describe  a  model  for  statistically  analyzing 
software  error  detection  and  correction  processes  during  software  functional 
testing.   The  purpose  of  the  model  is  to  provide  decision  aids  for  controlling 
the  quality  of  command  and  control  system  software.   The  inputs  to  the  model 
are  error  detection  histories  and  the  outputs  are  forecasts  of  the  future 
behavior  of  error  detection  and  correction  processes.   The  model  outputs  would 
provide  software  production  and  quality  control  management  with  quantitative 
guidelines  for: 

•  establishing  testing  strategies, 

•  making  the  acceptance/rejection  decision, 

•  evaluating  the  tradeoff  between  incremental  quality  improvement 
and  incremental  resource  investment. 

SCOPE 

The  scope  of  this  paper  is  to  define  certain  software  error  terminology,  set 
forth  the  mathematical  formulation  of  the  model  and  focus  in  detail  on  the 
methodology  used  for  error  detection  and  correction  forecasting.   With  respect 
to  the  last  item,  forecasted  values  are  compared  with  actual  counts  of  detected 
and  corrected  errors  obtained  from  a  limited  number  of  Navy  Tactical  Data 
System  (NTDS)  software  trouble  reports.   Several  variations  of  the  forecasting 
methodology  are  compared  and  tentative  conclusions  are  drawn  concerning  the 
validity  of  the  methodology.   The  validation  of  the  forecasting  methodology 
against  a  large  amount  of  NTDS  software  error  data  was  beyond  the  scope  of 
this  particular  research  effort. 


TERMINOLOGY 

Increasingly,  the  quantitative  evaluation  of  computer  software  is  recognized 
as  critically  important  to  the  effective  functioning  of  computer  systems. 
The  body  of  knowledge,  literature  and  models  concerned  with  the  measurement 
and  evaluation  of  software  characteristics  is  growing  [1,  2,  3,  4,  5,  6]. 
Despite  this  upsurge  in  activity,  there  do  not  exist  accepted  definitions  and 
methodology  for  describing  and  analyzing  software  error  measurements.   There- 
fore, it  is  inappropriate  to  use  such  terms  as  "software  error,"  "software 
quality  control"  or  "software  quality  assurance"  with  the  expectation  of 
achieving  uniformity  of  interpretation  and  acceptance.   Consequently,  certain 
terms  will  be  defined  as  they  apply  to  this  paper. 

Software  Error.   Here  we  are  concerned  with  errors  in  programming  logic 
which  lead  to  undesirable  results  during  program  execution.   Examples 
are  the  storing  of  data  in  incorrect  memory  locations  or  accessing  the 
wrong  file  on  a  disc  unit.   Deviations  from  performance  which  are  caused 
by  an  inherent  limitation  of  the  numerical  or  algorithmic  technique  are 
called  performance  deficiencies  and  are  not  classified  as  software  errors. 
An  example  is  the  lack  of  precision  in  a  computation  which  results  from  the 
truncation  property  of  the  algorithm,  or  an  insufficient  word  length  in 
the  hardware.   Also,  errors  or  failures  which  are  strictly  attributable 
to  hardware  are  not  counted  as  software  errors. 

Software  Quality.   This  is  the  propensity  of  errors  to  occur  in  a  program 
under  stated  operating  conditions,  as  determined  by  such  measures  as  the 
number  of  errors  per  unit  time,  time  between  errors,  errors  per  instruc- 
tion executed  and  criticality  of  certain  errors  to  mission  success. 


Software  Quality  Control.   Management  systems  and  procedures  which  are 
employed  during  program  production  and  testing  to  produce  software  with 
acceptable  error  characteristics. 

Functional  Testing.   A  type  of  software  testing  designed  to  ensure  the 
user  that  system  functions,  such  as  target  tracking,  can  be  executed 
correctly.   This  type  of  testing  is  normally  conducted  by  the  user  after 
the  individual  modules  have  been  debugged  by  programmers  and  the  modules 
have  been  integrated  together  as  a  program. 

Software  Quality  Assurance.   Criteria  for  user  acceptance  or  rejection 
of  software  products  and  the  user  acceptance  tests  (tests  of  user  func- 
tions) and  procedures  used  during  functional  testing. 

Modules.   A  set  of  computer  instructions  which  accomplishes  some  specific 
function,  such  as  the  tracking  or  display  function  in  NTDS . 
Program.   In  the  context  of  this  paper,  a  program  is  a  set  of  modules 
designed  to  carry  out  an  operational  mission,  such  as  the  computer  pro- 
grams used  in  NTDS  aboard  an  aircraft  carrier. 

APPROACH 

Ideally,  the  first  step  in  software  quality  management  would  be  the  establish- 
ment of  quality  specifications  which  would  be  used  during  the  testing  period 
to  determine  the  acceptability  of  the  software  product  for  its  intended  use. 
Unfortunately,  the  use  of  software  specifications  which  pertain  to  error,  as 
opposed  to  performance  characteristics,  is  not  widespread.   Since  there  is 
normally  no  baseline  against  which  test  results  can  be  compared,  a  problem 
arises  concerning  the  choice  of  criteria  for  determining  the  acceptability  of 
a  software  product.   The  approach  used  in  this  model  is  to  monitor  the  occurrence 


of  software  errors  and  to  forecast  future  numbers  of  cumulative  detected  and 
corrected  errors  in  order  to: 

(1)  Identify  the  trade-off  function  between  error  reduction  and  the  cost 
of  error  reduction,  where  cost  may  be  measured  in  terms  of  calendar 
time,  computer  time, or  manpower  required. 

(2)  Provide  a  quantitative  basis  for  accepting  or  rejecting  software 
during  functional  testing. 

(3)  Provide  a  quantitative  basis  for  deciding  whether  additional  testing 
is  warranted  based  on  the  relationship  between  incremental  error 
reduction  and  incremental  cost. 

Quantitative  measures  of  software  quality  which  can  be  applied  to  the  above  are: 

•  cumulative  number  of  detected  or  corrected  errors  as  a  function  of  time, 

•  rate  of  error  detection  or  correction;  this  is  the  first  derivative 
of  the  preceding  function, 

•  number  of  errors   that  has  been  detected  but  not  corrected  after  a 
specified  time, 

•  time  required  to  detect  a  specified  number  of  cumulative  errors, 

•  time  required  to  correct  a  specified  number  of  cumulative  detected 
errors, 

•  time  required  to  correct  a  specified  number  of  detected  but  uncorrected 
errors. 

In  the  above  measures,  the  variable  time  can  be  either  calendar  time 
or  software  test  time.   Also,  the  measures  apply  to  both  historical  and  fore- 
casted error  values. 

Unlike  hardware  which  wears  out  or  deteriorates  with  time,  software 
should  improve  with  time  as  more  of  the  latent  errors  are  detected  and  corrected 


However,  there  are  exceptions  to  this  general  characteristic.   When  an  error 
is  removed,  it  is  possible  that  one  or  more  new  errors  will  be  introduced. 
Also,  most  operational  software  is  subject  to  modification  as  the  result  of 
application  changes  or  design  improvements.   Consequentially,  the  time  series 
of  error  counts  over  equal  time  intervals  will  not  necessarily  be  monotonically 
decreasing,  and  the  time  between  error  occurrences  will  not  necessarily  be 
monotonically  increasing,  over  the  life  cycle  of  the  software.   However,  the 
trend  of  these  series  will  be  decreasing  and  increasing,  respectively,  when 
observed  over  an  extended  time  period  [6]. 

It  is  important  to  recognize  that,  with  exceptions  occurring  in  trivial 
programs,  software  is  seldom  free  of  errors.   Errors  may  reside  undetected  in 
software  for  many  years  until  a  particular  set  of  input  data  causes  a  previously 
untraversed  module  path  to  be  executed  [4]. 

APPLICABILITY 

Software  Testing. 

This  model  is  applicable  to  the  analysis  of  software  errors  which  are 
detected  during  the  functional  test  phase  of  software  testing.   It  is  not 
applicable  to  the  detailed  and  highly  individualistic  test  procedures  employed 
by  the  programmer  prior  to  the  conduct  of  functional  tests.   In  particular, 
we  are  concerned  with  tests  which  are  made  after  the  individual  modules  are 
linked  together  as  a  program.   The  reason  for  the  distinction  is  that  programmer 
debugging  procedures  are  highly  individualized  [1].   Also,  the  selection  of 
test  procedures  and  test  data  may  be  a  deterministic  process  and,  in  general, 
be  less  suitable  for  a  probabilistic  analysis  than  functional  testing.   Func- 
tional testing  is  designed  to  test  the  ability  of  the  program  to  produce  desired 


outputs  with  a  given  sequence  of  inputs.   Inputs  may  consist  of  console 
operator  actions;  inputs  from  radar,  magnetic  tape  or  other  sensors  and 
devices;  or  a  script  of  inputs  recorded  on  magnetic  tape  which  simulates 
console  operator  input  actions.   Any  input  errors  are  not  counted  as  software 
errors. 
Software  Errors. 

Software  errors  can  be  classified  according  to  the  programming  and 
hardware  characteristics  of  the  error,  such  as  an  incorrect  memory-to-memory 
transfer  caused  by  an  error  in  addressing.   In  addition,  errors  can  be  classi- 
fied according  to  their  effect  on  mission  success.   Both  types  of  classifica- 
tion are  useful.   However,  the  latter  classification  is  much  more  difficult 
to  make  than  the  former  because  it  is  necessary  to  associate  three  items  of 
information: 

(1)  the  manifestation  of  the  error  (incorrect  target  symbol  on  the 
display  console) ; 

(2)  the  effect  on  the  mission  (failure  to  identify  a  hostile  target);  and 

(3)  the  programming  cause  of  the  error  (incorrect  interpretation  by  the 
program  of  an  operator  hostile  target  input). 

Although  the  variability  among  error  counts  which  are  used  to  measure 
software  quality  could  probably  be  decreased  by  classifying  and  counting  errors 
by  category,  the  resulting  sample  sizes  would  be  significantly  reduced.   Con- 
sequently, as  a  practical  matter,  only  a  limited  number  of  categories  can  be 
used  for  error  classification.   In  NTDS  software  error  reporting,  errors  are 
classified  according  to  the  effect  on  the  mission  as  follows: 

.  High.   The  computer  will  stop  if  this  type  of  error  occurs.   Example: 

attempt  to  address  data  outside  the  memory  address  range  of  the  computer, 


•  Medium.   This  type  of  error  will  cause  a  degradation  in  system 
performance.   Example:   target  position  is  not  updated  with  suffi- 
cient frequency. 

•  Low.   This  type  of  error  will  be  distracting  or  annoying  but  will  not 
normally  result  in  degraded  performance,  although  lowered  performance 
could  result  if  the  operator  is  unable  to  cope  with  the  problem. 
Example:   the  programmed  refresh  rate  of  the  display  console  is  low 
and  causes  fading  of  symbol  displays. 

In  order  to  achieve  maximum  sample  size  for  evaluating  the  forecasting 
methodology  used  in  the  model,  errors  were  not  separated  according  to  the 
classification  given  above.   However,  when  errors  are  separated  by  category, 
the  same  forecasting  methodology  is  used  for  each  forecast.   Forecasting 
accuracy  will  be  affected  by  categorization  because  sample  size  and  variability 
among  error  counts  are  reduced. 

ASSUMPTIONS 

It  is  assumed  that  the  number  of  errors  which  is  detected  during  a  time 
interval  and  the  collection  of  error  counts  over  a  series  of  time  intervals 
are  modelled  by  a  random  variable  and  a  stochastic  process,  respectively. 
Errors  which  are  repeated  as  a  result  of  repeating  the  execution  of  the  module 
under  identical  test  conditions,  and  without  having  corrected  the  error,  are 
not  counted.   Only  "new"  errors,  which  occur  for  the  first  time  (execution), 
are  counted. 

Functional  testing  is  not  entirely  probabilistic  because  test  plans 
and  procedures  are  structured  to  a  certain  extent  and  are  not  selected  at 
random.   Although  the  types  of  tests,  test  sequence  and  test  data  are  not 


randomly  selected,  variables  such  as  time  to  detect,  or  correct,  an  error, 
and  number  of  detections  or  corrections  per  time  interval  may  be  modelled  as 
random  variables. 

The  probability  of  detecting  an  error  is  a  function  of  type  of  test 
plan,  characteristics  of  input  data,  and  number  and  locations  of  errors  in  a 
module.   Prior  to  the  selection  of  a  test  plan  and  input  data,  all  errors 
(1  through  7  in  Set  A,  Figure  1)  have  equal  probability  of  detection  because 
there  is  usually  no  prior  knowledge  concerning  the  presence  of  errors  or 
the  probability  of  error  detection.   Once  the  next  test  is  selected,  those 
errors  falling  outside  the  domain  of  the  test  (Errors  6  and  7  in  Set  B)  now 
have  zero  probability  of  detection.   Those  errors  falling  within  the  domain 
of  the  test  (Errors  1  through  5  in  Set  C)  now  have  non-zero  probability  of 
detection.   Once  the  input  data  are  selected,  the  combination  of  input  data 
and  test  plan  leads  to  the  following  assumed  situation: 

•  Error  1  in  the  set  affected  by  the  Input  Data  A  has  a  probability  of 
one  of  detection.  Errors  2  and  3  in  Set  D  now  have  zero  probability 
of  detection  in  this  test  because  the  detection  of  Error  1  will  halt 
the  computer.  The  future  probability  of  detection  of  Errors  2  and  3 
may  be  dependent  upon  the  previous  detection  and  correction  of  Error 
1,  i.e.  the  detection  and  correction  of  one  error  may  allow  another 
error  to  be  detected  because  the  path  to  the  second  error  is  no  longer 
blocked  by  the  first  error. 

•  Errors  4  and  5  in  Set  E  are  not  affected  by  Input  Data  A  and  have  zero 
probability  of  detection. 

Assuming  that  Error  1  is  corrected  and  the  test  repeated,  it  may  now 
be  possible  to  detect  Errors  2  or  3,  if  the  detection  of  one  or  both  depends 
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Figure  1.   Error,  test,  and  input  data  relationships. 
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upon  Input  Data  A  and  the  prior  detection  and  correction  of  Error  1  (Figure  2) . 
This  is  an  example  of  the  detection  of  an  error  being  dependent  on  the  detec- 
tion and  correction  of  another  error.   Conversely,  the  detection  of  Errors  2 
and  3  may  be  independent  of  the  use  of  Input  Data  A  and  the  prior  detection 
and  correction  of  Error  1.   This  is  an  example  of  the  detection  of  an  error 
being  independent  of  the  prior  detection  and  correction  of  another  error. 

Once  a  second  Input  Data  Set  B  is  chosen,  and  assuming  that  Errors  2 
and  3  have  not  been  detected,  a  new  set  of  errors,  Set  F,  is  affected  and  Set 
G  is  not  affected  (Figure  2) . 

It  should  be  stressed  that  the  above  probabilities  and  dependencies 
(if  any)  among  errors  are  unknown  prior  to  the  execution  of  the  tests.   This 
information  can  be  obtained  only  after  detailed  post  mortem  analysis  of  the 
tests.   Generalizations  concerning  error  dependencies  are  difficult  to  make 
due  to  the  great  variety  of  module  structures.   However,  the  probability  seems 
low  of  having  many  situations  in  which  errors  are  located  in  a  module  in  such 
a  way  that  the  detection  of  Error  2  depends  simultaneously  on  the  removal 
of  Error  1  and  the  use  of  the  same  test  and  input  data  which  was  used  in  the 
detection  of  Error  1.   This  reasoning  leads  to  the  assumption  of  independence 
among  errors  in  the  model  formulation. 

MODEL  FORMULATION 

A  major  objective  of  the  model  is  to  forecast  the  mean  number  of  cumulative 
errors  for  some  future  time   T,   assuming  that  the  forecast  is  made  at  time   t, 
t  <  T,   and  observations  have  been  made  of  the  number  of  errors  which  have 
occurred  in  intervals  of  unit  length,  designated  by  the  index   i,   from  interval 
1  through   t.   Due  to  the  charac teristics  of  the  data,  a  calendar  time  scale 
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is  used.   Ideally,  error  counts  should  be  made  with  respect  to  computer 
operating  time  intervals.   However,  the  available  data  contains  counts  per 
calendar  time  interval.   In  order  to  make  the  count  meaningful  with  respect  to 
the  exposure  of  a  module  to  testing,  a  count  interval  of  one  week  was  chosen 
because,  with  respect  to  the  data  used  in  the  analysis,  modules  are  tested 
for  approximately  equal  computer  time  durations  each  week. 
Error  Detection. 

The  actual  number  of  errors  detected  during  interval   i   is  denoted  by 

x.   and  the  estimate  of  the  number  of  errors  detected  in  interval   i   is  denoted 

1 

by  m. .   It  is  assumed  that:   (1)  in  accordance  with  arguments  presented  earlier, 
the  number  of  errors  detected  in  each  time  interval  is  independent  of  the 
number  of  errors  detected  in  any  other  time  interval;  (2)  the  detected  error 
counts  have  a  probability  density  function  of  the  same  form  in  each  time  interval 
but  with  different  means;  and  (3)  the  mean  number  of  detected  errors  decreases 
from  interval  to  interval  as  a  result  of  the  continuing  detection  and  correc- 
tion of  original  errors.   In  addition,  it  is  assumed  that  the  rate  of  error 
detection  in  an  interval  is  proportional  to  the  number  of  errors  in  the  interval. 
Specifically,  the  detected  error  process  is  assumed  to  be  a  non-homogeneous 
Poisson  process  with  an  exponentially  decaying  intensity  function 

d(i)  =  a  exp(-3i),     a  >  0,   3  >  0;  (1) 

mean  value  function 

D(i)  =  (a/6)[l-exp(-3i)];  (2) 

and  mean  number  of  errors  in  each  interval   i   equal  to 
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mi  =  (a/3)[exp(-6(i-l))  -  exp(-3i)].  (3) 

The  time  estimated  to  detect  a  cumulative  number  of  errors   D  is  derived 
from  (2)  as 

i,  =  {log[a/(a-BD)]}/0.  (4) 

a 

The  detection  rate  of  errors  is  given  by  (1)  and  the  time  estimated  for  the 
detection  rate  to  reach  the  value   d   is  derived  from  (1)  as 

i^  =  [log(a/d)]/e.  (5) 

Error  Correction. 

In  addition  to  detected  errors,  the  cumulative  mean  of  which  is 
given  by  (2),  the  software  quality  control  function  is  also  concerned  with  the 
correction  of  detected  errors.   Assuming  that  resources  are  committed  to  the 
correction  of  errors  in  proportion  to  number  of  errors  detected,  the  cumulative 
mean  corrected  error  function   C(i)   will  have  the  same  form  as  (2)  but  will 
lag  (2)  by  Ai.   This  is  the  time  estimated  to  correct  a  number  of  errors 
equal  to   D(i)  -  C(i).   Thus,  for   i  £Ai, 

C(i)  =  D(i-Ai)  =  (a/S)[l-exp(-3(i-Ai))].  (6) 

The  lag  Ai  can  be  estimated  by  finding  Ai   such  that  the  relationship 

C(t)  =  D(i-Ai)  (7) 

is  satisfied  from  the  empirical  data,  where   t   is  the  time  of  making  a  fore- 
cast.  These  relationships  are  shown  in  Figure  3.   The  time  estimated  to  correct 
a  cumulative  number  of  errors   C   is  derived  from  (6)  as 
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Figure  3.   Relationship  between  cumulative  detected  D(i) 
and  corrected   C(i)   errors. 
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ic  =  Ai  +  {log[a/(a-BC)]}/6.  (8) 

The  correction  rate  of  errors  is  derived  from  (6)  as 

c(i)  =  a   exp[-3(i-Ai)],  (9) 

for  i  £  Ai.  The  time  estimated  for  the  correction  rate  to  reach  the  value 
c   is  derived  from  (9)  as 

i'  =  Ai  +  [log(a/c)]/B.  (10) 

c 

Difference  Between  Detected  and  Corrected  Errors. 

The  number  of  detected  errors  which  has  not  been  corrected  after  time 
i   is  derived  from  (2)  and  (6)  as 

R(i)  =  D(i)  -  C(i)  =  (a/6)[exp(-6i)][exp(3Ai)-l],  (11) 

for  i  £  Ai.  This  relationship  is  shown  in  Figure  3.  For  a  given  value  of 
R(i)  existing  at  time  i,  the  time  (Ai)  estimated  to  correct  R  errors 
is  derived  from  (11)  as 

Air  =  [log(RB  exp(3i)/o+l)]/B.  (12) 

The  time  estimated  to  reach  a  given  value  of   R  is  derived  from  (11)  as 

ir  =  {log[a(exp(3Ai)-l)/R3]}/S.  (13) 

All  of  the  above  expressions  except  (3)  ,  which  is  used  for  parameter 
estimation  as  described  in  the  next  section,  may  be  used  as  quantitative  aids 
in  software  quality  control  and  assurance  functions. 
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ERROR  FORECASTING  METHODOLOGY 

This  section  will  describe  the  methodology  used  for  estimating  the  error 
detection  and  correction  function  parameters  a   and  B.   Once  these  parameters 
have  been  estimated,  forecasts  are  made  of  cumulative  detected  and  corrected 
errors  in  the  forecast  interval   t+1   through  T.   With  forecasts  available, 
the  type  of  analyses  described  in  the  previous  section  can  be  made. 

Parameter  estimates  are  made  by  using  the  criteria  of  maximum  likeli- 
hood and  weighted  least  squares.   The  first  approach  was  to  use  the  method  of 
maximum  likelihood  only  and  to  use  all  error  count  observations   x.   in  the 
interval   1   through   t.   It  was  found  that  forecasting  accuracy,  as  computed 
by  the  sum  of  squared  deviations  in  the  interval   t+1   through  T,   was  unaccept- 
able for  error  prediction  purposes.   This  result  is  caused  by  differences  between 
the  actual  and  model  processes  which  occur  over  an  extended  period  of  time. 
That  is,   a  and  $   appear  to  be  constant  only  over  restricted  time  intervals, 
and  vary  when  the  observation  and  forecast  intervals  are  long.   A  further  prob- 
lem is  that  the  time  of  error  observation  is  not  recorded  on  NTDS  software 
trouble  reports.   The  date  information  provided  on  the  trouble  report  is  date 
of  report  preparation  rather  than  date  of  observation.   In  many  cases  the  two 
dates  are  the  same;  in  other  cases  there  is  a  time  lag  between  observation  and 
report  preparation  and,  hence,  a  difference  in  dates.   The  extent  of  time 
recording  discrepancies  cannot  be  obtained  from  the  data  originator.   This 
source  of  error  does  not  appear  to  be  a  major  contributor  to  forecast  inaccura- 
cies.  Since,  in  general,  software  error  data  is  very  difficult  to  obtain, 
whereas  the  supply  of  NTDS  data  is  plentiful,  the  approach  which  has  been  taken 
is  that  the  model  must  be  capable  of  utilizing  a  certain  amount  of  imprecise 
data  and  still  provide  reasonable  forecasting  accuracy. 
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Since  the  error  process  apparently  changes  over  time,  recent  observations 
are  generally  more  useful  than  earlier  observations,  although  this  will  not  be 
the  case  in  every  forecasting  situation.   Therefore,  when  selecting  a  fore- 
casting methodology,  methods  which  provide  for  unequal  representation  of  error 
counts  in  the  forecast  are  considered  in  addition  to  a  method  which  uses  all 
counts  on  an  equal  basis.   In  order  to  accomplish  this  objective,  it  was 
necessary  to  develop  criteria  for  determining: 

(1)  to  what  extent  historical  observations  would  be  considered  in 
forecasting, 

(2)  how  much  of  the  historical  time  record  to  include  when  estimating 
a   and   B, 

Three  methods  were  developed: 

(1)  All  of  the  error  counts  in  intervals  from  1   through   t  are  used. 

(2)  None  of  the  error  counts  in  intervals  from  1   through   s-1  are  used, 
where   s   is  an  index,  with  unit  increment,   2  £  s  £  t;   all  of  the 
counts  in  intervals  from  s   through   t  are  used. 

(3)  The  cumulative  error  count  from  intervals   1   through   s-1   is  used; 
individual  error  counts  in  intervals  from  s   through   t  are  used. 
Method  (1)  is  appropriate  if  changes  in  error  count  from  interval  1 

through   t,   for  all  intervals,  are  representative  of  the  future  ability  to 
detect  errors.   Method  (2)  is  appropriate  if  the  most  recent  error  count  changes 
from  interval  s   through   t  are  representative  of  the  future  ability  to  detect 
errors.   Finally,  Method  (3)  is  intermediate  to  (1)  and  (2),  and  is  appropriate 
if  the  individual  error  count  changes  from  interval   1   through   s-1  are  not 
representative  of  the  future  ability  to  detect  errors,  but  the  total  error 
count  from  1   through  s-1  and  the  changes  in  error  count  from  s   through 
t  are  representative. 
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Method  (1)  involves  applying  the  method  of  maximum  likelihood  to  all 

error  counts   x   from  interval   1   through   t,   Methods  (2)  and  (3)  require 
i 

a  criterion  for  selecting  s.   This  was  accomplished  by  first  estimating  a 

and   3   (by  the  method  of  maximum  likelihood),  for  each  value  of   s   from  2 

through   t,   and  then  computing  the  sum  of  weighted  squared  deviations   SD 

w 

between  error  estimates  m.   and  actual  errors  x.   from  interval   1   through 

i  i 

t,   where  the  intervals  are  of  equal  length  and  the  summation  for  each  value 
of   s   is  computed  over  the  same  number  of  intervals.   The  best  value  of   s   in 
Methods  (2)  and  (3)  is  determined  by  choosing  the  value   s,   and  corresponding 

positive  values  of  a  and   3,   that  produce  the  minimum  value  of   SD  .   The 

w 

three  methods  can  be  evaluated  by  comparing  values  of   SD  computed  from 

unweighted  squared  deviations  between  forecasted  errors  and  actual  errors  in 

the  interval   t+1   through  T. 

Weighted  Least  Squares  Criterion. 

The  variance  of  the  deviation  e.   between  estimated  errors  m.   and 

i  l 

actual  errors   x. ,   when  zero  bias  is  assumed,  is  equal  to  the  variance  of  m., 
as  shown  by 

Var(e.)  =  E[(m.-x.)2]  =  E(m?)  -  x?  (14) 

l        11         i     i  ' 

Vardru)    =   E(m2)    -    [ECnu)]2   =   E(m2)    -   x2 .  (15) 

This  variance  is  also  given  by 

Var(e.)  =  E(e2)  -  [E(e.)]2  =  E(e2),  (16) 

assuming  that  E(e.)  =  0.   Also,  since  the  process  is  assumed  to  be  Poisson, 
the  mean  and  variance  are  equal.   Combining  this  fact  with  (3),  (14),  (15) 
and  (16) ,  we  have 

Var(ei)  =  Vardiu)  =  E(e2)  =  (a/3)  [exp(-Bi)  ]  [l-exp(-B)  ]  .       (17) 
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In  order  that   Var(e.)  =  E(e7)   be  constant,  as  required  by  the  method  of 
least  squares,  (17)  must  be  multiplied  by  the  term  exp(3i)   in  order  to 
eliminate  its  time-varying  term.   Thus, 

Var(e!)  -  E[(e!)2]  =  exp(3i)Var(e. )  =  exp(3i)E(e2) ,         (18) 

where   e|   is  a  constant  variance  deviation  term.   Since  E(e2)  =  Er(m.-x.)2l, 
i  ill 

this  fact  combined  with  (18)  gives 

E[(e|)2]  =  exp(3i)E(e2)  =  exp(3i)E{  {  (a/3)[exp(-Bi)  ]  [1-exp  (-3)  ]  -x±}2  }.     (19) 

Thus,  in  order  to  select  the  best  values   a*   and   3*   in  Methods  (2)  and  (3), 

by  the  least  squares  criterion,  involving  the  minimization  of   2,.(e!)2,   we 

find   s  =  s*   such  that 

i=t 
SD  =  I  exp(3i){(a/3)[exp(-3i)][l-exp(-3)]-x.}2  (20) 

W    i=l  X 

is  minimized.   In  order  to  compare  Methods  (1),  (2)  and  (3),   SD   (unweighted) 

is  computed  over  the  interval  from  t+1   through  T.   The  method  which  produces 

the  lowest  values  of   SD  for  positive  values  of  a  and  3   is  the  preferred 

method. 

Maximum  Likelihood  Estimation  of  Parameters. 

The  parameters  a  and   3   are  estimated  by  the  method  of  maximum  likeli- 
hood.  For  Method  (3),  the  estimate  is  made  with  respect  to  a  starting  interval 
s   for  making  observations  of  the  number  of  errors  per  time  interval.   The 
relationship  between  time  intervals  and  error  counts  is  shown  in  Figure  4. 

The  development  of  the  likelihood  function   L(ot,3,s),   where  the  con- 
stant factors   x!   are  ignored  in  the  density  functions,  for  estimating  parameters 
for  Method  (3)  follows.   This  is  also  the  formulation  for  Method  (1)  when   s=l. 
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Figure  4.   Relationship  between  time  intervals  and  error  counts 
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L(a,B,s)  =  f(exp(-Ms_1))(Msf11)][(exp(-ms))(insS)][(exp(-ms+1))(m^1)]  ... 

[(exp(-ms+k))(iBs^k)]  ...  [(exp(-mt))(mtZ)]  (21) 

where  M  .,   is  the  mean  number  of  errors  in  the  interval   1   through  s-1 , 
s-1         ° 

m  ,   m    ,   m     and  m   are  the  mean  number  of  errors  in  intervals   s,   s+1 , 

S      S"tX      S  i'K  l 

s+k   and   t,   respectively,  and   X   ,,   x  ,   x  ,.,   x  ,.   and   x   are  the 

s-1    s    s+1    s+k        t 

corresponding  numbers  of  errors.   The  mean  number  of  errors,  based  on  the 
assumption  of  a  Poisson  distribution  in  each  interval,  is  as  follows: 

Ms-1   =    (a/3)[exp(0)-exp(-(s-l)3)]    =    (a/3) [1-exp (-(s-l)3) ] 


m 


s 


=    (a/3)[exp(-(s-l)3)-exp(-s3)]    =    (a/3) [exp(- (s-l)3) ] [l-exp(-3) 


ms+l  =    (a/3)[exp(-s3)-exp(-(s+l)3)]    =    (a/3) [exp(-s3) ] [1-exp (-3) ] 


ms+k  =    (a/3)[exp(-(s+k-l)3)-exp(-(s+k)3)]    =    (a/3) [exp(-(s+k-l)3) ] [l-exp(-3) ] 


mfc   =    (a/3)[exp(-(t-l)3)-exp(-t3)]    =    (a/3) [exp(-(t-l)3) ] [l-exp(-3) ] 

Substituting  the  foregoing  in  (21)  and  taking  the  natural  logarithm, 
gives 

log  L  =  -(Ms_1+ms+ms+1+...+ms+k+...+mt)  +  X^  log  m^^ 

+  xs    log  mg   +  xs+1    Log  ms+1   +    ...    +  x^   log  m^  +    . . .    +  xfc   log  mt 

=    (a/3)|exp(-3t)-l]   +  Xg_1 [log(a/3)   +  log(l-exp (-(s-l)3) ) ]    + 
xs[log(a/3)-(s-l)3+log(l-exp(-3))]+xs+1[log(a/3)-s3+log(l-exp(-3)) ]+...+ 
x        [log(a/3)-(s+k-l)3+log(l-exp(-3))]+...+x    [Jog(a/3)-(t-l)3+log(l-exp(-3))] 

S"T~K.  L 


21 


Taking  the  partial  derivatives   9  log  L/3a  and   9  log  L/33   and  setting 
these  equal  to  zero,  we  have 

(a/6)  =  xt/[l-exp(-et)]  (22) 

t-s 
and   (s-l)(X  _  )/[exp((s-l)6)-l]+X   / [exp(3)-l]-tX ./ [exp(St)-l]  =  I        (s+k-l)x,,,    (23 

k=0 

where  X     and  X   are  the  number  of  errors  detected  in  the  interval   s 
s,  t        t 

to   t  and  1   to   t,   respectively.   When  s  =  1,   (23)  reduces  to 

t-1 
l/[exp(6)-l]  -  t/[exp(3t)-l]  =  (£    kx,  , .  ) /X  .  (24) 

k=0   R  L       t 

This  is  the  result  that  would  be  obtained  if  the  method  of  maximum  likelihood 

was  applied  to  x  ,x~,x  ,...,x    (the  individual  error  counts  in  every  interval 

from  1   through   t) .   It  should  be  noted  that  when  s  =  1   or   s  =  2   is 

substituted  in  (23),  the  same  result  (24)  is  obtained. 

In  order  to  estimate  a  and   3   in  accordance  with  Method  (2)  all  error 

counts  in  the  interval   1   through  s-1  are  ignored.   This  is  accomplished  by 

substituting  t  -  s  +  1   for   t   in  (24)  in  order  to  account  for  the  fact  that 

there  are  only   t  -  s  +  1   intervals  to  consider,  when  the  first   s-1   intervals 

are  ignored.   In  addition,  the  subscript  of  x   in  the  summation  of  (24)  must 

be  adjusted  to  make  x   the  first  error  count.   The  resulting  expression  is 

t-s 
l/[exp(3)-l]  -  (t-s+l)/[exp(3(t-s+l))-l]  =  (£   kx  .,)/X  .      (25) 

k=0  s  R   c 

RESULTS 

A  limited  test  of  the  validity  of  the  model  was  made  by  evaluating  the  fore- 
casting accuracy  against  software  error  data  from  an  NTDS  module.   This  module 
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Severi 

ty 

Low 

3% 

Medium 

37 

High 

60 

had  a  total  of  160  detected  errors  over  a  period  of  132  weeks.   The  first  20 
weeks,  involving  30  detected  errors,  was  used  to  make  a  variety  of  forecasts, 
using  the  model  equations  and  forecasting  methodology  previously  described. 
Forecasts  of  cumulative  detected  and  corrected  errors  were  made  for  weeks 
21-30,  21-40  and  21-50  and  compared  with  actual  values.   Forecasts  were  also 
made  of  the  time  required  to  detect  or  correct  a  specified  number  of  errors. 
In  addition,  an  evaluation  was  made  of  the  accuracy  of  the  three  forecasting 
methodologies:   Methods  (1),  (2)  and  (3).   The  composition  of  30  errors  in  the 
observation  period  (weeks  1-20)  is  as  follows: 

Source 

Production      30% 

Test  70 

100 

tiign 

100 

The  composition  of  the  53  errors  during  the  forecast  period  (weeks  21-50)  is 

as  follows: 

Errors:   Stages  Errors:   Severity 

Production       8%  Low  6% 

Test  79  Medium      31 

User  13  High        63 

100  100 

Production  errors  are  those  errors  detected  during  the  time  the  software 
contractor  has  cognizance  of  the  software.   Test  errors  are  those  errors 
detected  by  the  customer  during  functional  test,  after  the  software  has  been 
delivered  by  the  contractor  to  the  customer.   User  errors  are  those  errors 
detected  by  the  customer  after  functional  testing  has  been  completed  and  the 
software  has  been  put  into  operational  use.   The  High,  Medium  and  Low  categories 
are  the  error  severity  classifications  used  in  NTDS  trouble  reports,  as  previously 
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described.   Although  forecasts  can  be  made  by  category,  this  was  not  done  in 
the  analysis  being  discussed  because,  for  the  initial  evaluation,  it  was  desired 
to  use  the  maximum  amount  of  data. 

In  general,  the  errors  used  in  this  analysis  were  detected  under  condi- 
tions which  closely  approximate  a  functional  testing  environment — the  type  of 
testing  for  which  the  model  is  applicable.   Also,  the  composition  of  errors 
in  the  observation  period  is  reasonably  representative  of  the  composition  of 
errors  in  the  forecast  periods.   Descriptions  follow  of  the  various  analyses 
which  were  performed. 
Forecast  Error  as  a  Function  of  s. 

This  analysis  was  performed  in  order  to  determine  whether  forecast  error 
varies  as  a  function  of  s,   the  first  interval  where  individual  detected  soft- 
ware error  counts  are  used  for  forecasting  in  Methods  (2)  and  (3) .   Figure  5 
shows  forecast  error  as  a  function  of   s   for  three  forecast  periods  for 
Method  (1),   s  =  1,   and  Method  (3),   s  =  2  -  15 .   Method  (2)  is  not  shown 
because  a  number  of  negative   3's  were  generated  as   s  was  varied.   Negative 
values  of   3  have  no  meaning  because  the  model  is  based  on  the  assumption 
that  the  intensity  function  is  a  decreasing  function,  at  least  over  an 
extended  period  of  time,  and  that  a  counter  trend  would  be  attributable  to 
errors  in  observation, or  to  the  introduction  of  new  software  errors  as  a  result 
of  attempting  to  correct  another  software  error.   Method  (1)  is  included  on 
Figure  5  because  it  corresponds  to   s  =  1   in  Method  (3).   It  is  seen  that, 
for  the  module  tested,  forecast  error  varies  considerably  with   s   for  the 
more  distant  forecast  periods  and  that  using  the  maximum  amount  of  data  does 
not  provide  the  greatest  accuracy.   Whereas  forecast  error  is  relatively  insensi- 
tive to  s   for  short  range  forecasting,  it  is  very  sensitive  to  s   (decreases 
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Figure  5.   Sum  of  forecast  deviations  squared   SD  vs   s 
for  detected  errors 
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with  increasing   s)   for  long  range  forecasting.   Apparently,  the  process 
changes  over  time,  and  recent  software  error  counts  are  more  representative 
of  the  changing  process  than  are  early  error  counts. 
Forecast  Error  and  Weighted  Least  Squares  Criterion. 

It  was  of  interest  to  test  the  validity  of  the  weighted  least  squares 
criterion.   This  criterion  is  used  to  select  particular  values  of  a   and   6 
from  the  set  of  paired  values  obtained  from  maximum  likelihood  parameter  esti- 
mates of  the  detected  software  error  process.   Validity  was  tested  by  determining 
how  well  forecast  error  (as  measured  by  the  sum  of  unweighted  deviations  squared 
in  the  forecast  period)  associates  with  the  sum  of  weighted  deviations  squared 
in  the  observation  period.   The  values  shown  in  Figure  6  were  obtained  by  using 
various  values  of  s   (2-15)  for  Method  (3).   The  positive  association  is 
reasonably  good.   Also  shown  are  plots  of  forecast  error  versus  the  sum  of 
unweighted  squared  deviations  in  the  observation  period.   It  is  seen  that  this 
association  is  basically  negative.   Hence,  as  would  be  expected,  a  weighted 
least  squares  criterion  is  superior  to  an  unweighted  least  squares  criterion 
for  the  type  of  model  being  analyzed  (variance  of  error  count  decreases 
exponentially  with  time). 
Forecast  Error  and  Forecast  Methodology. 

Cumulative  forecast  error  is  plotted  against  forecast  week  for  detected 
software  errors  for  the  three  methods  in  Figure  7.  In  the  case  of  Methods  (2) 
and  (3),  the  error  function  corresponding  to  the  best  value  of  s  is  plotted. 
This  is  the  value  which  produces  minimum  error  during  the  observation  period, 
weeks  1-20.  The  three  methods  are  also  compared  in  Figure  8,  where  cumulative 
actual  and  forecasted  detected  (using  Equation  2)  errors  are  plotted.  Again, 
the  best  values  of   s   are  used  for  Methods  (2)  and  (3).   In  general,  fore- 


2(> 


10- 


50 


Forecast   Period:         >r| 
"  Weeks  21  -50    s/ 


Forecast   Period 
Weeks  21  -40 


S"2-  15:  Method  (3) 
Observation  Period:  Weeks  1-20 

SDf    Sum  of  Unweighted  Deviations 
Squared  in  Forecast  Period 

Sum  of  Weighted  Deviations 
Squared  in  Observation  Period 

SD      Sum  of  Unweighted  Deviation 
Squared  in  Observation  Period 


Forecast  Period 
Weeks  21-30 


v3> 


60 


70 


SD 


do 

ow 


40 


50 


6  0     SD, 


Figure  6.   Forecast  deviations   SD  vs  weighted  observation  deviations   SD 


and  vs   unweighted  observation  deviations 
errors. 


SD 


for  detected 


ow 


27 


140 


SD 


^,~  Method  ll) 
C       S«I,0C«2.70, 


J 


/?=  .066 


Method  (2) 

S  ■  19  (best  value) 

Method  (3) 

S    s 15 (best  value) 


a=i.62,y3-.oo8 


Observation  Period:  Weeks  |-20 


20 


30 


4cT 


50 


Figure  7.   Sum  of  forecast  deviations  squared  SD  vs 
forecast  week  i   for  detected  errors. 


28 


90- 


80 


70  ■ 


60- 


50 


D(i) 


40- 


30 


20- 


10- 


Actuol    Errors 


/'Method  (2) 
»   -  19  (best  value) 

{  Method  (3) 

|  s    -15  (best  value) 

I  0C-l62,/}».008 


Method(l) 
e»l,a»2.70, 
y3».066 


Observation  Period -.Weeks  1-20 
Forecast  Period     Weeks  21-50 


30 


i 
40 


i 
50 


Figure  8.   Cumulative  actual  and  forecasted  detected  errors  D(i)   vs 
week   i. 


29 


casting  accuracy  is  greater  for  Methods  (2)  and  (3),  because  recent  software 

error  observations  are  more  representative  of  the  changing  software  error 

detection  process  than  are  early  observations. 

Detected  and  Corrected  Errors. 

A  test  was  made  of  the  validity  of  the  forecasted  cumulative  corrected 

error  function,  as  given  by  equation  (6) .   This  was  accomplished  by  first 

obtaining  an  estimate  of  Ai,   the  lag  between  error  detection  and  correction, 

by  using  the  empirical  data  and  equation  (7).   Then,  using  the  best  values  of 

a  and   3,   corresponding  to  the  best  value  of   s,   equation  (6)  was  used  to 

forecast  cumulative  corrected  errors  and  compared  to  actual  cumulative  errors 

in  Figure  9.   In  addition,  in  order  to  show  the  contrast  between  detected  and 

corrected  errors,  cumulative  forecasted  and  actual  detected  errors  are  plotted 

in  Figure  9.   As  would  be  expected,  the  accuracy  of  corrected  error  forecasts 

is  less  than  the  accuracy  of  detected  error  forecasts  because,  in  addition  to 

estimates  of  a  and   6,   it  is  also  necessary  to  make  an  estimate  of  Ai. 

Time  to  Detect  Errors. 

The  forecasted  time   i,   to  detect  a  specified  number  of  cumulative 

d 

errors,  as  given  by  equation  (4),  was  computed  for  five  actual  values  of   D(i) 

and  compared  with  the  actual  detect  time   i   in  Table  I.   Forecast  accuracy  is 

good  for  short  range  forecasts  but  decreases  for  long  range  forecasts. 

Time  to  Correct  Cumulative  Errors. 

The  forecasted  time   i   to  correct  a  specified  number  of  cumulative 

c 

errors,  as  given  by  equation  (8),  was  computed  for  five  actual  values  of   C(i) 
and  compared  with  the  actual  time  to  correct   i   in  Table  I. 
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TABLE    I 


COMPARISON    OF   ACTUAL   AND    FORECASTED    ERROR    VALUES 


i 

D(i) 
(Actual) 

*d 

(Forecast) 

C(i) 
(Actual) 

i 
c 

(Forecast) 

R(i) 

Ai 
r 

(Actual) 

(Actual) 

(Forecast) 

(Actual) 

(Forecast) 

30 

39 

26.7 

17 

25.0 

22 

18.9 

16.7 

16.1 

35 

48 

33.7 

31 

34.7 

17 

18.2 

14.2 

13.1 

40 

64 

47.2 

31 

34.7 

33 

17.6 

18.4 

25.2 

45 

77 

59.3 

31 

34.7 

46 

16.9 

20.3 

35.0 

50 

83 

65.3 

52 

50.9 

31 

16.2 

22.5 

25.5 

actual   time   in  weeks 


D(i)  actual  number  of   cumulative   detected  errors 


forecasted   time   to   detect      D(i)      number  of   cumulative  errors 
(in  weeks) 


C(i) 


actual  number  of  cumulative  corrected  errors 


forecasted  time  to  correct   C(i)   number  of  cumulative  errors 
(in  weeks) 


R(i) 


actual  and  forecasted  number  of  detected  but  uncorrected  errors 


Ai 


actual  and  forecasted  time  required  to  correct  R(i)   number  of 
errors 


lag  in  correcting  errors  =  14  weeks 


a  =   1.62,    6   =   .008 
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Number  of  Detected  but  Uncorrected  Errors. 

The  number  of  detected  but,  as  yet,  uncorrected  (remaining)  errors 
R(i),   which  would  exist  at  time   i,   was  forecasted  for  five  values  of   i, 
by  using  equation  (11),  and  compared  with  the  actual  number  of  such  errors  in 
Table  I.   Forecast  accuracy  is  good  for  short  range  forecasts  but  decreases 
for  long  range  forecasts.   It  is  to  be  expected  that  the  accuracy  of  R(i) 
forecasts  will  not  be  as  great  as  the  accuracy  of  D(i)   and  C(i)   forecasts, 
because   R(i)   is  a  function  of  both   D(i)   and   C(i). 
Time  to  Correct  Remaining  Errors. 

The  time  Ai   to  correct  a  given  number  of  detected  but  uncorrected 

r  e 

errors  was  forecasted  for  five  values  of  R(i) ,   by  using  equation  (12) ,  and 
compared  with  the  actual  time  in  Table  I. 

CONCLUSIONS 

Since  only  a  single  software  module  was  analyzed,  although  one  with  a  large 
number  of  reported  software  troubles,  it  would  be  inappropriate  to  generalize 
the  results  to  the  universe  of  NTDS,  and  certainly  to  that  of  other  types  of 
software.   However,  based  on  (unpublished)  analysis  by  the  author  of  approxi- 
mately thirty  NTDS  modules,  the  great  similarity  in  the  characteristics 
(amplitude  and  shape)  of  the  time  series  of  detected  errors  among  modules  sug- 
gests the  applicability  of  the  model  to  NTDS  software  in  general  and,  possibly, 
to  other  large  scale  software  production  activities.   On  the  basis  of  limited 
experience,  the  forecasting  accuracy  of  the  various  types  of  predictors  seems 
adequate,  and  the  predictors  could  be  employed  as  decision  aids  in  software 
testing  management. 
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Additional  areas  for  research  are:   (1)  the  validation  of  the  model 
against  a  large  number  of  NTDS  modules,  (2)  validation  of  the  model  against 
other  large  scale  software  testing  error  data  and  (3)  the  collection  and  use 
of  software  testing  resources  (manpower,  computers,  money)  utilization  data, 
in  conjunction  with  this  model,  for  the  development  of  functions  which  would 
indicate  the  trade-off  between  software  quality  improvement  and  resource 
expenditure . 
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